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Abstract 

To successfully work on variable selection, sparse model structure has be- 
come a basic assumption for all existing methods. However, this assumption 
is questionable as it is hard to hold in most of cases and none of existing 
methods may provide consistent estimation and accurate model prediction in 
nons-parse scenarios. In this paper, we propose semiparametric re-modeling 
and inference when the linear regression model under study is possibly non- 
sparse. After an initial working model is selected by a method such as the 
Dantzig selector adopted in this paper, we re-construct a globally unbiased 
semiparametric model by use of suitable instrumental variables and nonpara- 
metric adjustment. The newly defined model is identifiable, and the estima- 
tor of parameter vector is asymptotically normal. The consistency, together 
with the re-built model, promotes model prediction. This method naturally 
works when the model is indeed sparse and thus is of robustness against non- 
sparseness in certain sense. Simulation studies show that the new approach 
has, particularly when p is much larger than n, significant improvement of 
estimation and prediction accuracies over the Gaussian Dantzig selector and 
other classical methods. Even when the model under study is sparse, our 
method is also comparable to the existing methods designed for sparse mod- 
els. 
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1. Introduction 



In this paper we consider the linear model Y = /3 T X + full model that 

contains all possibly relevant predictors Xi, ■ • • , X p in the predictor vector X. Here 
the dimension p of X is large and even larger than the sample size n. As in many 
cases, most of the predictors are insignificant in a certain sense for the response Y, 
variable selection is then necessary Although this topic has been very intensively 
investigated in the literature, the following issues have not yet received enough 
attention in the literature. 

• The success of almost all existing variable/feature section methodologies criti- 
cally hinges on sparse model structure. Resulting working model that contains 
"significant" predictors is still assumed to be a linear model having identical 
model structure as the full model. Note that this happens only when the full 
model has exactly sparse structure. However, in most cases, the full model 
may not be exactly sparse. This then causes that model identifiability is even 
an issue. More precisely, after model selection, resultant working model is 
usually biased because the cumulated bias caused by excluding too many "in- 
significant" predictors is non-negligible even when every coefficient associated 
with "insignificant" predictor is indeed very small. As such, it is necessary to 
refine working model so that it becomes unbiased and identifiable, otherwise 
the estimator based on it cannot be consistent and the prediction would not 
be accurate. It is worth pointing out that obviously the refined working model 
is not necessary to have identical model structure to the original full model 
unless the full model is sparse. To the best of our knowledge, there are no 
research works handling these issues. In this paper, we will propose a method 
to reconstruct working model, define consistent estimation of the coefficients 
associated with the significant predictors contained in the selected model and 
further improve prediction accuracy. 
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In this paper, "non-sparsity" is in the sense that only a few regression coefficients 
are large and the rest are small but not necessary to be zero. A detailed definition 
on "non-sparsity" will be given in the next section for model identification. Further- 
more, it is known that checking either sparsity or non-sparsity of a high-dimensional 
model is a hard task. When there is no prior information on sparsity in advance, 
as a robustness or conservative consideration, employing non-sparse model is also 
useful for avoiding modelling risk. Of course, when the model under study is really 
sparse, it is in hope that new method also works. 

It is noted that Zhang and Huang (2008) also investigated a model in which 
only a few regression coefficients are large and the rest are small, although they 
still called it sparse model. In their paper, the rate consistency was investigated, 
which means that the number of selected variables is of the same convergence rate 
as that of the variables with large coefficients in an asymptotic sense. This consis- 
tency does not imply the conventional estimation consistency and does not promote 
prediction accuracy. This is because in the scenario they investigated, estimation 
consistency and prediction accuracy have not yet been discussed and are still the 
challenges. In our paper, by re-modeling selected working model obtained from the 
full model, estimation consistency can be achieved and model prediction accuracy 
can be improved. 

For sparse models there are a great number of research works in the literature. 
We list a few here. The LASSO and the adaptive LASSO (Tibshirani 1996; Zou 
2006), the SCAD (Fan and Li 2001; Fan and Peng 2004), the Dantzig selector 
(Candes and Tao 2007) and MCP (Zhang 2010) can be used to provide consistent 
and asymptotically normally distributed estimation for the parameters in selected 
working models. In practice, there are no approaches to check sparsity before using 
them. 

To motivate our method, we focus mainly on the Dantzig selector. The Dantzig 
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selector has received much attention, and an asymptotic equivalence between the 
Dantzig selector and the LASSO in certain senses was discovered by James et al. 
(2009). Under the uniform uncertainty principle, the resulting estimator achieves an 
ideal risk of aC^logp with a large probability. This implies that for large p, such 
a risk can be however large and then even under sparse structure the relevant esti- 
mator may also be inconsistent. To reduce the risk and improve the performance of 
relevant estimation, the Gaussian Dantzig Selector, a two-stage estimation method, 
was suggested in the literature (Candes and Tao 2007). The corresponding estimator 
is still inconsistent when the model is non-sparse (for details see the next section). 
Another method is the Double Dantzig Selector (James and Radchenko 2009), by 
which one may choose a more accurate model and, at the same time, get a more 
accurate estimator. But it still critically depends on the choice of shrinkage tuning 
parameter and sparsity condition. Taking these problems into account, Fan and Lv 
(2008) introduced a sure independent screening method that is based on correlation 
learning to reduce high dimensionality to a moderate scale below the sample size. 
Afterwards, variable selection and parameter estimation can be accomplished by 
sophisticated method such as the LASSO, the SCAD or the Dantzig selector. The 
relevant references include Kosorok and Ma (2007), Chen and Qin (2009), James, 
Radchenko and Lv (2009) and Kuelbs and Anand (2009), among others. However, 
when the model is non-sparse and the dimension p of the predictor vector is very 
large, the model is not identifiable and the estimation consistency by existing meth- 
ods is usually very difficultly achieved and even not possible. It causes that model 
prediction would be less accurate and further data analysis would not be reliable 
unless we can correct bias. 

Thus, for non-sparse model, we have no reasons to expect an unbiased working 
model that has an identical form to its full model when only a small portion of 
predictors are regarded as significant and are selected into the working model. Bias 
correction is necessary. In this paper, we focus our attention on working sub-model 
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that is chosen by the Dantzig selector. For the full model, we will suggest an iden- 
tifiability condition and a re-modeling method to identify a working model, and 
further to construct consistent and asymptotically normal distributed estimator for 
the coefficient vector in the working sub-model. To achieve this, an adjustment 
will be recommended to construct a globally identifiable and unbiased semipara- 
metric model. The adjustment only depends on a low-dimensional nonparametric 
estimation by using proper instrument variables. The resulting estimator 9 of the 
parameter vector 9 in the sub- model satisfies = O p {n~ l ) and the asymptotic 

normality if the dimension q of 9 converges to a fixed constant with a probability 
tending to one. Furthermore, new consistent estimators together with the unbiased 
adjustment sub- model or the original sub-model defined in this paper, can also im- 
prove model prediction accuracy. This is the first attempt in this area for us to 
understand modeling after variable selection when sparse structure is not imposed. 
It is worth mentioning that although insignificant predictors are ruled out in the 
selection step, we do not absolutely abandon them, while use them to construct 
adjustment variables. 

It is worth pointing out that the newly proposed method is a general method 
which may also be applicable with other variable selection approaches. On the 
other hand, the new method is robust against non-sparseness at the cost that the 
new algorithm is slightly more complicated to implement than existing methods are 
because we transfer a linear model to a nonlinear model. However to avoid the risk 
of possible unreliable further analysis caused by the inconsistency of estimation and 
promote more accurate prediction, such a cost is worthwhile to pay. 

The rest of the paper is organized as follows. In Section 2 the properties of the 
Dantzig estimator for the high- dimensional linear model are reviewed. In Section 3, 
an identifiability condition is assumed, a bias-corrected sub-model is proposed via 
introducing instrumental variables, and a nonparametric adjustment and a method 
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about selecting instrumental variables are suggested. Estimation and prediction 
procedures for the new sub-model are given and the asymptotic properties of the 
resulting estimator and prediction are obtained. In Section 4 an approximate al- 
gorithm for constructing instrumental variables is proposed for the case when the 
dimension of the related nonparametric estimation is relatively large. Simulation 
studies are presented in Section 5 to examine the performance of the new approach 
when compared with the classical Dantzig selector and other methods. The techni- 
cal proofs for the theoretical results are provided in the online supplement to this 
article. 

2. A brief review for the Dantzig selector 

Recall the full model: 



where Y is the scale response, X is the p-dimensional predictor and e is the random 
error satisfying E(e\X) = and Cov(e\X) = a 2 . Throughout this paper, of the 
primary interest is to build a valid sub-model of (2.1) whose size goes to a non- 
random number with a probability tending to one. Non-randomness of selected 
sub-model is for further model ident inability. We then build an adjusted model 
that is unbiased and identifiable. The second interest of our paper is to construct 
consistent estimators for significant predictors in the rebuilt model and further to 
obtain reasonable model prediction via our estimation and selected sub-model or 
adjusted model. 

To introduce a re-modeling method and a novel estimation approach, we first re- 
examine the Dantzig selector. Let Y = (Yi, • • • , Y n ) T be the vector of the observed 
responses and X = (X 1; • ■ ■ , X n ) T = (x 1; ■ ■ • , x p ) be the nxp matrix of the observed 



Y = f3 T X + e 
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predictors. The Dantzig selector of (3 is defined as 

/3 d = argmin \\/3\\e 1 subject to sup |xjr| < X p a (2.2) 

pass i<i<p 

for some \ > 0, where ||/?||^ = Y^j=i an d r = Y — X/3. As was shown by 
Candes and Tao (2007), under sparsity assumption and other regularity conditions, 
this estimator satisfies that, with large probability, 

\\0 D - P\\% < CaHogp, (2.3) 

where C is free of p and \\(3 D — /3\\J 2 = YTj=i{Pf ~ fij) 2 - I n f ac t this is an ideal risk 
and thus cannot be improved in a certain sense. However, such a risk can become 
large and may not be negligible when the dimension p > n. On the other hand, if 
without sparsity condition, the risk will be even larger than that given in (2.3). 

To reduce the risk and promote the performance of the Dantzig selector, one 
often uses a two-stage selection procedure (e. g., the Gaussian Dantzig Selector) 
to construct a risk-reduced estimator for the obtained sub-model (Candes and Tao 
2007). For example, we can first estimate I = {j : fij ^ 0} with I = {j : > k,<j} 
for some k > and then construct an estimator 

fa = ((X^) r X«)- 1 (X«) T Y 

for Pi and shrink the other components of to be zero, where f3j is the restriction 
of to the set /, and X^) is the matrix with the column vectors according to /. 

When model is not sparse, the set / is very large and there is no method available 
in the literature to consistently estimate Pj. However, for variable / feature selection, 
we are mainly interested in those significant variables that are associated with large 
values of coefficients. Thus, denote /?/ = 8, a g-dimensional vector of interest. To 
identify the set J, we will give an identifiability condition to ensure that the random 
set / converges to / with probability tending to one. For the sake of description, 
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we temporarily assume that / is fixed. Without loss of generality, suppose that 
can be partitioned as (3 = {9 T ) YY an d, correspondingly, X is partitioned as 
X = (Z T , U T ) T . Then the above two-stage procedure implies that based on the 
Dantzig selector, we use the sub-model 

Y = 6 T Z + 7] (2.4) 

to replace the full-model (2.1), where rj = 7 T £7 + e is regarded as error. Here the 
dimension q of 9 can be either fixed or diverging with n at a certain rate. Since the 
above sub-model is a replacer of the full model (2.1), we call 9 and Z the main parts 
of (3 and X, respectively. From (2.1) and (2.4) it follows that E(rj\Z) = YE(U\Z). 
When both 7^0 and E(U\Z) ^ 0, the sub-model (2.4) is biased and thus the 
two-stage estimator 9s = fij is also biased. It shows that the two-stage estimator 9s 
of 9 is also inconsistent. Note that for any non-sparse model, 7^0 always holds. 
As such, the above classical method is not possible to obtain consistent estimation. 

An improved Dantzig selector is the Double Dantzig Selector (James and Rad- 
chenko 2009). By which more accurate model and estimation can be expected. In 
the first step, the Dantzig selector is used with a relatively large shrinkage tuning 
parameter X p defined above to get a relatively accurate sub-model in the sense that 
less insignificant predictors are contained. The Dantzig selector is further used in 
the selected sub-model to obtain a relatively accurate estimator of 9 via a small X p 
and data (Y, Z). However, such a method cannot handle non-sparse model either 
because the sub-model selected in the first step has already been biased. It is also 
noted that this method critically depends on twice choices of shrinkage tuning pa- 
rameter X p ; for details see James and Radchenko (2009). On the other hand, when 
the estimation consistency and asymptotic normality, rather than variable selection, 
heavily depend on the choice of X p , it is practically not convenient, and more seri- 
ously, the consistency is in effect not judgeable unless a criterion of tuning parameter 
selection can be defined to ensure the consistency. 
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3. Re-modeling and inference 

As was shown above, the sub- model (2.4) is usually biased and random after the 
variable selection determined by the Dantzig selector. Here the model randomicity 
means that the estimate / for the index set / defined in the previous section is 
random. As this section is long containing the main contributions, we separate 
it into several subsections, we first propose an identifiability condition for non- 
sparse models; subsection 3.2 investigates a re-modeling scheme; the estimation 
procedure is described in subsection 3.3. To highlight the procedure, we have a short 
subsection 3.4 to summarize the steps of the algorithm. The asymptotic behaviours 
are put in subsection 3.5. Subsection 3.6 discusses the prediction issue. 

3.1 Identifiability condition. Before re-modeling and inference, we first as- 
sume a condition to guarantee that the working sub-model (2.4) is identifiable with 
probability approaching one. Let \J\ be the number of elements in an index set 
J C {1, 2, • • • ,p} and J be the complement of J in the set {1, 2, • • • , p}. For a p- 
dimensional vector I = (li, ■ • • , l p ) T , denote by lj = a subvector whose entries 

are those of / indexed by J. 

(CO) Identifiability condition: 

1) Index set / satisfies that min^g/ > cn^-^l 2 and 
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mm '_ "7 > a/378, (a.s.) (3.1) 

'/^o,|K / |U 1 <||; / ||, 1 +2 C2 n( c i- 1 )/ 2 Vn\\l I \\ i2 



where constants < C\ < 1, c 2 > 0, c = Akbqa+A^Jk 2 b 2 q 2 a 2 + 3kc 2 bqa/8, 
b > \/2, q = \I\ and k > 0. 

2) / satisfies that = c 2 n {ci ~ 1)/2 and max ie /|/3,-| = o(n (ci " 1)/2 ). 

Part 1) of condition (CO) means that the coefficients in the selected set / are 
significant and the inequality (3.1) is to control the restricted eigenvalues. Such 



10 



an inequality is similar to the assumption in Bickel et al. (2009). Part 2) means 
the non-sparsity in the following sense: the coefficients that are associated with 
insignificant predictors may not be exactly zero but decays to zero at the rate of 
fi^i- 1 )/ 2 as sample size n goes to infinity. We can easily construct non-sparse 
models satisfying condition (CO). Under this non-sparse condition, all significant 
regression coefficients are contained in the selected set / in an asymptotic sense and 
therefore model identifiability is achieved when we select a working sub-model; for 
details see the following model selection principle and lemma. 

With condition (CO), we could select a set of indices as 

L = {1<] <p:\P?\>r n }, 

where r n is a predefined threshold value so that the obtained sub-model (2.4) is 
non-random with probability approaching one; the following lemma presents the 
details. 

Lemma 3.1 In addition to Condition (CO), assume that y/\ogp = kn Cl ^ 2 with 

< k < min{ 32X Mfc3~ 2 c 2 2 )VV ' and ° 3 > 2 ° 2 ' al1 the dm 9° nal elements °f the 
matrix X r X/n are equal to 1, A = bay and r n = |n^ Cl_1 ^ 2 . Then as n — oo 

P(I Tn =!)->!. 



The proof of the lemma is given in the Appendix. We use the condition on X T X 
only for the simplicity of proof. This lemma guarantees that, even the full model 
is non-sparse, the selected model equals the model with all significant predictors 
with probability tending to 1, i.e, the model selection is asymptotically exact and, 
therefore, the sub-model (2.4) could be regarded as a non-random model. 

3.2 Re-modeling. It is obvious that remodeling for bias correction is necessary to 
the selected sub- model (2.4) when we want to get a valid model and have consistent 
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estimation for the sub-vector 9 = (6±, • • • , 9 q ) T . To this end, a new model with an 
instrumental variable is established in this subsection. Suppose that the q significant 
predictors can be selected with probability going to one, which will be proved later. 
Denote Z* = (Z T , U {1) , ■ • • , U^) T and W = AZ*, where A is r x (q + d) matrix 
satisfying that its row vectors have length 1. Here U^ 1 ', • • • , U^ d ' are pseudo- variables 
(or instrumental variables), and, without loss of generality, they are supposed to 
be the first d components of U . It will be seen that we choose d = 1 usually. Set 
V = (a T U, W T ) T , where a is a vector to be chosen later. Choose A and ■ ■ • , 
such that 

E{(Z - E(Z\V))(Z - E(Z\V)) T } > 0. (3.2) 

This condition on the matrix we need can trivially hold because V contains W that 
is a weighted sum of Z and U^\--- ,U^ d l The use of condition (3.2) is to guarantee 
the identifiability of the following model. The choice of a, A and U will be discussed 
later. 

Denote g{V) = E(rj\V). Now we introduce a bias-corrected version of (2.4) as 

Y i = 6 T Z i + g(}Q + Z(yd, z = l,..- ,n, (3.3) 

where £(V) = rj — g{V). Obviously, if a in V is identical to 7 in 77, this model is 
unbiased, i.e., E(£\Z,V) = 0; otherwise it may be biased. This model can be re- 
garded as a partially linear model with a linear component 6 T Z and a nonparametric 
component g(V), and is identifiable because of condition (3.2). From this structure, 
we can see that when V does not contain the instrumental variable W and a = 7, 
the model goes back to the original working model of (2.4) as £ is zero and g(V) 
becomes the error term rj (if e is ignored). This observation motivates us to consider 
the following method. Introducing an instrumental variable V so that £ has a zero 
conditional mean, we can estimate g(-) so that we can correct the bias occurred in 
the original working model. Although a nonparametric function g(v) is involved, it 
will be verified that the dimension r + 1 of the variable v may be low usually. For the 
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case of large r, we will introduce an approximate method to deal with the problem. 
Note that for V, the key is to properly select a and W. From the above description, 
we can see that although a = 7 should be a natural and good choice, it is unknown 
and cannot be estimated consistently when the dimension is large. Taking this into 
account, we first consider a general a and construct a bias-corrected model with 
suitable W, or equivalently a suitable matrix A. 

To this end, we need the condition that (Z, U) is elliptically symmetrically dis- 
tributed. The ellipticity condition can be slightly weakened to be the following 
linearity condition: 



for some given matrix C. The linearity condition has been widely assumed in the 
circumstance of high-dimensional models. Hall and Li (1993) showed that it often 
holds approximately when the dimension p is high. 

With the above condition, we can find a matrix A so that the model (3.3) is 
always unbiased. Let T,z*,z* = Cov(Z*, Z*) and I^u,z* — Cov(U, Z*). Denote by r 
the rank of matrix S^^*. Obviously, r is bounded if q is fixed because in this case 
the dimension of matrix Z* is bounded. It is known by singular value decomposition 
of matrix that 



where P is a (p — d) x (p — d) orthogonal matrix, Q is a d x d orthogonal matrix and 



Let Q = (Qi, Q2), where Qi is a d x r orthogonal matrix. In this case, we have the 
following conclusion. 

Lemma 3.2 Under the above linearity condition, when Y>z*,z* = Iq+d and 



E(U\C T Z*) = E(U) + E [7i ^C(C r E x * > ^C)- 1 C r (Z* - E(Z*)) 




A r = diag(?7i, • 



,T] r ) with rjj > and rfj being positive eigenvalues of Yljj z *^u,z* ■ 



A = Ql, 



(3.4) 
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the model (3.3) is then unbiased, that is, E(£\Z,V) = 0. 

The condition T*z*,z* — Iq+d is common because the components of Z* that are 
selected from X form a low- dimensional matrix. The proof of the lemma is presented 
in Appendix. This lemma ensures that, with such a choice of A, the model (3.3) is 
always unbiased whether the model (2.1) is sparse or not. 

The covariance matrix E^z* is not always given and then needs to be estimated. 
It is known that the methods for constructing consistent estimation for large co- 
variance matrix have been proposed in the literature, for example the tapering 
estimators investigated by Cai, Zhang and Zhou (2010). Let T,jj^z* be a consistent 
estimator of T,jj,z* , satisfying 

\\tu,z*-^u,zA\ =O p (n- ? ), (3-5) 

where constant q > and || • || is a matrix norm. By the singular value decomposition 
of matrix mentioned above, we get an estimator of Q\ as Q\. Then A = Q\ is a 
consistent estimator of A, satisfying 

\\A-A\\ =O p (n- ? ). 

From the above choice of A, we can see that g(v) is a (r + l)-variate nonparametric 
function. To realize the estimation procedure and reduce the dimension of variable 
v, we choose a threshold v n > and then set (fij = if 4>j < v n . Suppose that 
4>i > ■ ■ ■ > 4> r * > v n and the corresponding orthogonal matrix is Q*, where r* < r 
and Ql is a (q + d) x r* matrix. In this case, the estimator of A is A = Q* 1 and as a 
result, g(v) is a (r* + l)-variate nonparametric function, in which the dimension of 
the variable is lower than or equal to the original one. Usually we choose d — 1, and 
similar to Irrepresentable condition (Zhao and Yu 2006), we may assume that the 
rank of covariance matrix of (Z, U) is low (equivalently, the correlation between Z 
and U is weak). In this case g(v) can be a low- dimensional nonparametric function. 
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If r* is still large, we use a row vector to replace A and will give a method in Section 4 
to find an approximate solution with which g{v) is a 2-dimensional nonparametric 
function. 

The above deduction and justification show that the above bias-correction proce- 
dure is free of the choice of a. However, choosing a proper a is of importance. An 
ideal choice of a should be as close to 7 as possible. In the estimation procedure, a 
natural choice is the estimator 7° of 7, which is obtained in the step of using the 
Dantzig selector. Also we will discuss the asymptotic properties of the estimator of 
9 for both the cases where a is given and is estimated respectively in Subsection 3.4. 

3.3 Estimation. Recall that the bias-corrected model (3.3) can be thought of as 
a partially linear model. We therefore design an estimation procedure as follows. 
First of all, as mentioned above, for any a, the model (3.3) is unbiased. Then we 
can design the estimation procedure when a has been determined by any empirical 
method. Given 9 and for any a, if A is estimated by A, then the nonparametric 
function g(v) is estimated by 



where V = (a T U, W T ) T with W = AZ*, Lh(-) is a (r + l)-dimensional kernel 
function. A simple choice of Lh(-) is a product kernel as 

1 /yW-#N /^(r+1) _ v (r+l)v 

W ~ •> = W* K (—iT-) ■ ■ ■ K ( S )' 

where V^\j = 1, • • • ,r + 1, are the components of V, K(-) is an 1-dimensional 
kernel function and h is the bandwidth depending on n. Particularly, when a is 
chosen as 7 D , we get an estimator of g(v) as 





where V = (U T j D , W T ) T . 
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With the two estimators of g(v), the bias-corrected model (3.3) can be approxi- 
mately expressed by the following two models: 

YinPZi + getfQ + tiVd and Y t n d T Z, + g e (Vi) + £(VJ), 
equivalently, 



YiKiPZi + ZiVi) and £ « 0^ + £(V;), (3.6) 



where 



* v T,l=iYkL H (Vk-Vd a _ 7 ELi^(^-v^) 

y; = y. _ ELi^04-^) ~ = _ ELj^M^l^) 
Thus, the sub-models in (3.6) result in two estimators of 6> as 

£ = ^-5"^ and fi^S^-Y^Y*, (3.7) 



n * — ' n 

i=l i=l 



where S n = - EiLi ^i^T an( ^ ^« = ElLi ^^i"; respectively. Here we assume 
that the bias-corrected model (3.3) is homoscedastic, that is Var(£(Vi)) = a v and 
Var(^(Vi)) = <jy for alH = 1, • ■ ■ ,n. If the model is heteroscedastic, we respectively 
modify the above estimators as, assuming that af{Vi) and af{V t ) are known, 

where S* = ± ELi ^^^ZJ and = ± ELi ^j^ZT, respectively, and <7?(V5) = 
V^ar(£(Vi)) and of (Vi) = Var(£(Vj)). When of(Vi) and of (V;) are unknown, we can 
use their consistent estimators to replace them; for details about how to estimate 
them see for example Hardle et al. (2000). In the following we only consider the 
estimators defined in (3.7). Finally, an estimator of g(v) can be defined as either 
9§(v) or g- d {v). 



16 



3.4 Algorithm. In summary, our algorithm procedure includes following three 
steps: 

Step 1. Choose an initial value of a, which may be arbitrary or estimated. 

Step 2. Decompose matrix S^x* (singular value decomposition) and then choose 
A = Q\ or A = Qi, an estimator of Q±, if Hu,z* is unknown. 

Step 3. Construct estimators by (3.7). 

The procedure shows that the new algorithm is slightly more complicated to 
implement than existing methods are by transferring an estimation procedure for 
linear model to that for nonlinear model. However, such a way can obtain consistent 
estimation and promote prediction accuracy for non-sparse model, and thus it is 
worthwhile to pay the expenses of computation. 

3.5 Asymptotic normality. To study this asymptotic behavior, the following 
conditions for the model (3.3) are assumed: 

(CI) The first two derivatives of g(v) and £(v) are continuous. 
(C2) Kernel function K(-) satisfies 

J K(u)du = 1, J u j K(u)du — 0, j — 1, • ■ ■ , k — 1, < J u k K(u)du < oo. 

(C3) The bandwidth h is optimally chosen, i.e., h = 0(n~ 1 ^ 2k+r+1 ^). 
(C4) The constant q in (3.5) satisfies q > 1/4. 

Obviously, conditions (C1)-(C3) are commonly used for semiparametric models. 
Condition (C4) is also satisfied for the consistency of covariance estimators, for 
example the tapering estimators investigated by Cai, Zhang and Zhou (2010). With 
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these conditions, the following theorem states the asymptotic normality for the bias- 
corrected estimator 9. 

Theorem 3.3 In addition to the conditions in lemma 3.1, assume that conditions 
(C1)-(C4) and (3.2) hold. For a given nonzero vector a, if q is fixed and p may be 
larger than n, then, as n — > oo ; 

Vn~(9-9) ^ N^o-^S' 1 ), 

where S = E{(Z - E(Z\V))(Z - E(Z\V)) T }. 

The proof for the theorem is postponed to the Appendix. 

Remark 3.1. This theorem shows that the new estimator 9 is y/n- consistent re- 
gardless of the choice of the shrinkage tuning parameter X p and thus it is convenient 
to be used in practice. Furthermore, by the theorem and the commonly used non- 
parametric techniques, we can prove that g§(v) is also consistent. In effect, we can 
obtain the strong consistency and the consistency of the mean squared error under 
some stronger conditions. The details are omitted in this paper. Note that these 
results can obviously hold when the model is sparse. Thus, for either sparse or non- 
sparse model, our method always ensures the estimation consistency for coefficients 
selected into the working model. 

To investigate the asymptotic properties for the second estimator 9 in (3.7) that 
is based on the Dantzig selector j D , we need the following more condition: 

(C5) The maximum eigenvalue Am of UU' is bounded for all n. 

(C6) Suppose that there exists a nonzero vector, say a, such that ||7 Z) — a\\e 2 = 
O p {n~^) for some \x satisfying \i > 1/4. 

Condition (C5) is commonly used for high dimensional models (see, e.g., Fan 
and Peng 2004). For condition (C6), we have the following explanations. As was 
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stated in the previous sections, we use a to denote an arbitrary vector. The vector 
a in condition (C6) is then different from that used before; here a is a fixed vector. 
For the simplicity of representation we still use the same notation a in different 
appearances. Condition (C6) is the key for the following theorem. This condition 
does not mean that the Dantzig selector 7 is consistent. The condition implies 
that when n is large enough, 7° is close to a non-random vector a asymptotically. 
Note that the accuracy of the solution of linear programming can guarantee that 
||7 D — a\\i 2 is small enough for a solution of the linear programming problem of 
(2.2) (see for example Malgouyres and Zeng, 2009). These show that condition (C6) 
is reasonable. Condition (C6) can actually be weakened, but for the simplicity of 
technical proof and presentation, we still use the current conditions in this paper. 

Theorem 3.4 Under conditions (C1)-(C6) and the conditions in Lemma 3.1,, 
when q is fixed and p may be larger than n, we have 



The proof of the theorem is given in the Appendix. 

Remark 3.2. This theorem shows that when 7 is replaced by the Dantzig selector 
7 , the resulting estimator 6 is also ^/n- consistent regardless of the choice of the 
shrinkage tuning parameter X p . On the other hand, although Theorems 3.3 and 
3.4 have an identical representation for the asymptotic covariances, the asymptotic 
covariances of the two estimators are in fact different because a and therefore V 
used in the two theorems are different. 

3.6 Prediction. Combining the estimation consistency with the unbiasedness of 
the adjusted sub-model (3.3), we obtain an improved prediction as 



V^(9-6)^N(0,a*rS- 1 ). 



Y = 9 T Z + g § (V) 
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and the corresponding prediction error is 

E{Y-vf = E((§-eyzy + E(g § (v)- g (v)y + E(e(v)) 

+2E((9 - eyZ{g § {V) - g{V))) + 2E((9 - efZ^V)) 
+2E((g § (V) - g{V))£{V)) 
= E(e(V)) + o(l). 

It is of a smaller prediction error than the one obtained by the classical Dantzig 
selector, and interestingly any high-dimensional nonparametric estimation is not 
needed. 

In contrast, the resulting prediction is defined as, when we use the new estimator 
9 and the sub-model (2.4), rather than the adjusted sub-model (3.3), 

Y s = & r Z + jj s , (3.9) 

where 



9e 

n 
i=i 



1 



We add g§ in (3.9) for prediction because the sub-model (2.4) has a bias E(g(V)), 
otherwise, the prediction error would be even larger. In this case, g& is free of the 
predictor U and the resultant prediction of (3.9) only uses the predictor Z in the 
sub-model (2.4). This is different from the prediction (3.8) that depends on both 
the low- dimensional predictor Z and high-dimensional predictor U . Thus (3.9) is a 
sub-model based prediction. The corresponding prediction error is 

E{Y-Y s f = E((9-9yzy + EQ § -g(V)Y + E(e(V)) 

+2E((9 - 9yzQ § ~ g(V))) + 2E((9 - 9yZ£(V)) 
+2E((g § - g(V)y(V)) 
= E(e(V)) + Var(g(V)) + 2E(E(g(V)) - g(V)y(V)) + o(l). 

This error is usually larger than that of the prediction (3.8). However, we can see 
that 

\E(E(g(V)) - g(V))C(V))\ < (Var(g(V))Var(Z(V))) 1/2 
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and usually the values of both Var(g(V)) and Var(£(V)) are small. Then such a 
prediction still has a smaller prediction error than the one obtained by the sub-model 
(2.4) and the common LS estimator 9$ = (Z T Z) _1 Z T Y as: 



Y s = T S Z. 



(3.10) 



Precisely, the corresponding error of Ys in (3.10) is 



E(Y - Y s ) 2 



e((6 s - eyzf + e(yu) 2 + a 2 + 2E((e s - 9yzru). 



Because 9s does not converge to 9, the values of both E((9s — 9) T Z) 2 and 2E((9$ — 
9) T Z^f T U) are large and as a result the prediction error is large as well. 

The above results show that in the scope of prediction, the new estimator can 
reduce prediction error under both the adjusted sub-model (3.3) and the original 
sub-model (2.4). We will see that the simulation results in Section 5 coincide with 
these conclusions. 

4. Calculation for A in the case of large r 

For the convenience of representation, we here suppose E(Z) = 0, E(U) = and 
Cov(Z*) = I. Lemma A2 given in Appendix shows that the model (3.3) is unbiased 
if A is a solution of the following equation: 



As was mentioned before, when r is large, a (r + l)-dimensional nonparametric 
estimation will be involved, which may lead to inefficient estimation. Thus, we 
suggest an approximation solution of (4.1), which is a row vector, that is, r — 1. 
Without confusion, we still use the notation A to denote this row vector. That is, 
we choose a row vector A such that 



E V , Z *A T {AA T ) 



AZ* = Zu, z *Z*. 



(4.1) 



A T AZ* = S+^E 




(4.2) 
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By (4.2), an estimator of A can be constructed as follows. Denote A = (a\, ■ ■ ■ ,a q , a q+ \), 
A k = a k A and E^E^x* = (D[, ■■■ ,D T q: D T q+l ) T ', where D k , k = 1, ■ • • , q + 1, are 
(g + l)-dimensional row vectors. Then we estimate A via solving the following 
optimization problem: 

9+1 

inf |(5(ai, • • • ,a g+ i) : ^a 2 k = l|, (4.3) 

fe=i 

where Q(ai, ■ • • , a ff+ i) = ~ ELi II (^fc - D k)Z*\\ 2 . By the Lagrange multiplier, 

we obtain the estimators of A k , k — 1, ■ • • , g + 1, as 

i n i n -l 

= - E z * z r + cc ^ e ^/ 2 ) (- E + c ^ J ) " > ( 4 - 4 ) 

i=l i=l 

where c k > 0, which is similar to a ridge parameter, depends on n and tends to zero 
as n — > oo, and is the row vector with fc-th component being 1 and the others 
being zero. Note that the constraint \\A\\ = 1 implies \\A k \\ = ±a k . By combining 
(4.4) with this constraint we get an estimator of a k as 

a k = ±\\A k \\ 

and consequently an estimator of A is obtained by 

A = (ax, ■ ■ ■ ,a q ,a q+1 ). 



5. Simulation studies 

In this section we examine the performance of the new method via simulation 
studies. By mean squared error (MSE), model prediction error (PE) and their 
stfiMSE and stdPE as well, we compare the method with the Gaussian-dantzig 
selector first. In ultra-high dimensional scenarios, the Dantzig selector cannot work 
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well, we use the sure independent screening (SIS) (Fan and Lv 2008) to bring dimen- 
sion down to a moderate size and then to make a comparison with the Gaussian- 
dantzig selector. As is well known, there are several factors that are of great impact 
on the performance of variable selection methods: sparse or non-sparse conditions, 
dimensions p of predictor X, correlation structure between the components of pre- 
dictor X, and variation of the error which can be measured by theoretical model 
R-square defined by R 2 = (Var(Y) — a 2 )/Var(Y). Then we will comprehensively 
illustrate the theoretical conclusions and performances. 

Experiment 1. This experiment is designed mainly for that with different 
choices of the theoretical model R-square R 2 , we compare our methods with Gaussian- 
dantzig selector. In the simulation, to determine the regression coefficients, we de- 
compose the coefficient vector (3 into two parts: ft and /?_/, where I denotes the 
set of locations of significant components of ft Three types of ft are considered: 
Type (I): ft = (1, 0.4, 0.3, 0.5, 0.3, 0.3, 0.3) r and 1= {1,2,3,4,5,6,7}; 
Type (II): ft = (1, 0.4, 0.3, 0.5, 0.3, 0.3, 0.3) T and I = {1, 17, 33, 49, 65, 81, 97}; 
Type (III): ft = (1, 0.4, -0.3, -0.5, 0.3, 0.3, -0.3) r and / = {1,2,3, 4, 5, 6, 7}. 
To mimic practical scenarios, we set the values of the components ft/j's of ft/ as 
follows. Before performing the variable selection and estimation, we generate ft-jj's 
from uniform distribution U(— 0.5,0.15) and the negative values of them are then 
set to be zero. Thus the model under study here is non-sparse. After the coefficient 
vector (3 is determined, we consider it as a fixed value vector and regard ft as the 
main part of the coefficient vector ft We use / to denote the set of subscript of 
coefficients 9 in ft that is the coefficients' subscript of predictors selected into sub- 
model. We assume X ~ N p (p, £x)> where the components of p, corresponding to / 
are and the others are 2, and the (i, j)-th element of £ satisfies S»j = (— p)' i_J ', 
< p < 1. Furthermore, the error term e is assumed to be normally distributed 
as e ~ N(0,a 2 ). In this experiment, we choose different a to obtain different type 
of full model with different R 2 . In the simulation procedure, the kernel function 
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is chosen as the Gaussian kernel K{u) = ^=exp{ — \}, A is chosen by (4.4) with 
c = 2 and = 0.2, the choice of parameter \ p in the Dantzig selector is just like 
that given by Candes and Tao (2007), which is the empirical maximum of |X r 2;|j 
over several realizations of z ~ A^O, I n ). 

The following Tables 1 and 2 report the MSEs and the corresponding PEs via 200 
repetitions. In these tables, Y is the prediction via the adjusted model (3.3) that is 
based on the full dataset, Y$ is the prediction via the sub-model (2.4) with the new 
estimator 9 defined in (3.7), Y§ stands for the prediction via the sub-model (2.4) 
and the Gaussian-dantzig selector 9$- For the definitions of Y, Y$ and Y$ see (3.8), 
(3.9) and (3.10), respectively. The purpose of such a comparison is to see whether 
the adjustment works and whether we should use the sub-model (2.4) when the 
high- dimensional data are not available (say, too expensive to collect), whether the 
new estimator 9 together with the sub-model (2.4) is helpful for prediction accuracy. 
The sample size is 50, and for the prediction, we perform the experiment with 200 
repetitions to compute the proportion r of which the prediction error of Y$ is less 
than that of Y$ in the 200 repetitions. The larger r is, the better the new prediction 
is. 
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Table 1. MSE, PE and their standard errors with n = 50, p = 100 and p = 0.1 



type R 2 


MSE(sidMSE) 
6 6s 


PE(sidPE) 
Y Y S Y S 


T 


0.98 
0.82 
(I) 0.67 
0.50 
0.31 


0.0032(0.0118) 0.0866(0.3519) 
0.0134(0.0544) 0.1197(0.1654) 
0.0273(0.1288) 0.0430(0.1283) 
0.0543(0.2387) 0.0694(0.2221) 
0.1028(0.4689) 0.1131(0.4876) 


0.1630(0.0405) 0.2299(0.0535) 1.1587(0.5549) 
0.6603(0.1497) 0.7249(0.1564) 1.4755(0.3475) 
1.3038(0.2952) 1.3438(0.3018) 1.4821(0.3266) 
2.5371(0.5500) 2.5919(0.5633) 2.7176(0.6020) 
4.9199(1.1856) 4.9960(1.2070) 5.0708(1.1965) 


200/200 
200/200 
166/200 
142/200 
126/200 


0.98 
0.84 
(II) 0.70 
0.53 
0.35 


0.0052(0.0202) 0.3540(1.4263) 
0.0162(0.0686) 0.4087(0.3730) 
0.0292(0.1112) 0.1770 (0.2559) 
0.0588(0.3024) 0.0942(0.2988) 
0.1107(0.6896) 0.1251(0.6368) 


0.2584(0.0569) 0.2744(0.0583) 1.1324(2.4262) 
0.8310(0.1823) 0.8417(0.1834) 3.7996(0.7909) 
1.4761(0.3028) 1.4727(0.3018) 2.6389(0.5804) 
2.8825(0.6534) 2.8700(0.6460) 3.2707(0.6758) 
5.4055 (1.1809) 5.3896(1.1856) 5.6004(1.2280) 


200/200 
200/200 
199/200 
171/200 
141/200 


0.98 
0.83 
(III) 0.69 
0.51 
0.33 


0.0028(0.0113) 0.0879(0.2938) 
0.0114(0.0531) 0.0873(0.1589) 
0.0234(0.0934) 0.1294(0.1667) 
0.0529(0.1715) 0.0913(0.1775) 
0.1006(0.5013) 0.1083(0.5158) 


0.1643(0.0410) 0.2365(0.0537) 1.2282(0.5590) 
0.5874 (0.1332) 0.6938(0.1533) 1.3483(0.3118) 
1.1922(0.2857) 1.2445(0.2961) 1.9950(0.4379) 
2.6373(0.5788) 2.7418(0.6098) 2.9601(0.6288) 
5.0952(1.2099) 5.1720(1.2241) 5.2372(1.2594) 


200/200 
200/200 
196/200 
164/200 
119/200 



The simulation results in Table 1 suggest that the adjustment of (3.3) works very 
well, the corresponding estimation (8) and prediction (Y) are uniformly the best 
among the competitors. Further, as we mentioned, when the full dataset is not 
available and we thus use the sub-model of (2.4), the new estimator 9 is also useful 
for prediction. It can be seen that Y$ with 6 is better than Y$ with the Gaussian- 
dantzig selector 6s, and the value of r is larger than 0.7 in 13 cases out of 15 cases 
and in the other 2 cases, it is larger than or about 0.6. 

To provide more information, we also consider the case with higher correlation 
between the components of X. Table 2 shows that when p is larger, the conclusions 
about the comparison are almost identical to those presented in Table 1. Thus it 
concludes that no matter p is larger or not, for different choices of R 2 , our new 
method always works quite well. 
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Table 2. MSE, PE and their standard errors with n = 50, p = 100 and p = 0.7 



type R 2 


MSE(sidMSE) 
6 6s 


PE(sidPE) 
Y Y S Y S 


T 


0.96 
0.71 
(I) 0.53 
0.35 
0.2 


0.0136(0.0504) 0.3285(0.4226) 
0.0253(0.1426) 0.0709(0.2401) 
0.0373(0.1621) 0.1108(0.2310) 
0.0613(0.3122) 0.0999(0.3289) 
0.1198(0.6479) 0.1292(0.6619) 


0.2472(0.0517) 0.2706(0.0599) 1.7397(0.3804) 
0.6530(0.1463) 0.6945(0.1557) 1.9892(0.2070) 
1.2779(0.2744) 1.3235 (0.2861) 1.5985(0.3736) 
2.3431(0.5342) 2.3694(0.5395) 2.6339(0.5799) 
5.1184(1.2643) 5.1347(1.2729) 5.1764(1.2420) 


200/200 
197/200 
177/200 
161/200 
129/200 


0.98 
0.84 
(II) 0.69 
0.52 
0.34 


0.0122(0.0484) 0.2730(0.3789) 
0.0201(0.0924) 0.1799(0.2037) 
0.0303(0.1338) 0.2899(0.4442) 
0.0644(0.3395) 0.11411(0.4388) 
0.1245(0.5615) 0.1831(0.6787) 


0.2648(0.0730) 0.2809(0.0757) 1.1952(0.2440) 
0.6567(0.1453) 0.6580(0.1452) 1.6477(0.3560) 
1.2955(0.2992) 1.2996(0.3047) 2.7125(0.5861) 
2.5572(0.5558) 2.5633(0.5582) 3.2790(0.6834) 
5.0731(1.1850) 5.0818(1.1743) 5.5988(1.2782) 


200/200 
200/200 
200/200 
191/200 
161/200 


0.96 
0.74 
(III) 0.56 
0.38 
0.23 


0.0239(0.0626) 0.6020(2.1653) 
0.0315(0.1158) 0.4401(0.5248) 
0.0749(0.2373) 0.1736(0.2679) 
0.0687(0.3227) 0.1701(0.3809) 
0.1740(0.8078) 0.2446(0.8718) 


0.2596(0.0560) 0.2897(0.0630) 1.6754(1.4970) 
0.6435(0.1435) 0.6485(0.1442) 2.7859(0.6035) 
1.3334(0.2947) 1.4367(0.3217) 1.8643(0.3965) 
2.3637(0.4538) 2.4645(0.4818) 2.9415(0.5992) 
4.8488(1.1812) 4.8887(1.1968) 5.1471(1.1499) 


200/200 
200/200 
189/200 
178/200 
145/200 



We are now in the position to make another comparison. In Experiments 2 and 
3 below, we do not use the data-driven approach as given in Experiment 1 to select 
A p , while manually select several values to see whether our method works or not. 
This is because in the two experiments, it is not our goal to study shrinkage tuning 
parameter, but is our goal to see whether the new method works after we have a 
sub-model. 

Experiment 2. In this experiment, our focus is that with different choices of the 
correlation between predictors and sub-models, we compare our method with others. 
The distribution of X is the same as that in Experiment 1 except for the dimension 
of the covariate. The coefficient vector (3j is designed as type (I) above and is 
designed as in Experiment 1. Thus the model here is also non-sparse. Furthermore, 
the error term e is assumed to be normally distributed as e ~ A^O, 0.2 2 ). 

As different choices of X p usually lead to different sub-models, equivalently, to 
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different estimators 7 of 7, we then consider different choices of A p in the simulation 
study. The setting is as follows. For n = 50, p = 100 and p = 0.1,0.3,0.5,0.7, we 
consider two cases for each p: 
p = 0.1 : 

Case 1. \ p = 3.97, 7={1,2,3,4,5,6,7}, 7={1, 2, 3, 4, 5, 6, 7 } 
Case 2. \ p = 6.53, 7={1,2,3,4,5,6,7}, 7={1, 3, 4, 6, 95 } 
p = 0.3 : 

Case I. \ p = 3.32, 7={1,2,3,4,5,6,7}, J={1, 2, 3, 4, 5, 6 } 
Case 2. \ p = 6.77, 7={1,2,3,4,5,6,7}, 7={ 1, 2, 4, 6, 23 } 
p = 0.5 : 

Case 1. A p = 3.72, J={1,2,3,4,5,6,7}, J={1, 2, 4, 5, 6, 7 } 
Case 2. A p = 7.29, J={1,2,3,4,5,6,7}, 7={1, 4, 5, 7, 41, 58, 72 } 
p = 0.7 : 

Case 1. A p = 3.50, J={1,2,3,4,5,6,7}, 7={1, 3, 4, 7, 41, 75} 

Case 2. A p = 7.22, 7={1,2,3,4,5,6,7}, 7={1, 4, 7, 51, 64, 67, 68, 83 } 



Table 3. MSE, PE and their standard errors with n = 50, p = 100, 5 = 7 



p Case 


MSE(strfMSE) 
As 


PE(stdPE) 
Y Y S Y S 


T 


■» J 


0.0052(0.0242) 0.2929(0.3877) 
0.0104(0.0357) 0.2347(0.1784) 


0.2580(0.0528) 0.2612(0.0527) 3.0195(0.6691) 
0.5135(0.1074) 0.6430(0.1282) 5.921(0.4172) 


200/200 
200 /200 


0.3 I 


0.0070(0.0289) 0.4067(1.6692) 
0.0163(0.0458) 0.5048(0.4107) 


0.2732(0.0590) 0.3324(0.0735) 5.6406(1.8289) 
0.4048(0.0881) 0.5014(0.1078) 6.4471(0.7697) 


200/200 
200/200 


0, J 


0.0079(0.0336) 0.4826(1.9425) 
0.0136(0.0512) 0.1532(0.1835) 


0.2436(0.0551) 0.3053(0.0674) 5.8204(1.8152) 
0.3655(0.0841) 0.4245(0.0914) 6.4357(0.3262) 


200/200 
200/200 




0.0157(0.0602) 0.2296(0.2970) 
0.0149(0.0637) 0.1914(0.1420) 


0.2688(0.0580) 0.3198(0.0711) 6.6313(0.3560) 
0.2974(0.0624) 0.3225(0.0672) 7.5435(0.1169) 


200/200 
197/200 



From Table 3, we can see clearly that the correlation is of impact on the perfor- 
mance of the variable selection methods: the estimation gets worse with larger p. 
However, the new method uniformly works much better than the Gaussian Dantzig 
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selector, when we compare the performance of the methods with different values of 
X p and then with different sub-models. We can see that in case I, the sub-models 
are more accurate than those in case II in the sense that they can contain more sig- 
nificant predictors we want to select. Then, the estimation based on the Gaussian 
Dantzig selector can work better and so can the new method. 

In the following, we consider data with higher-dimension. 

Experiment 3. In this experiment j3_j is designed as in Experiment 1. Thus 
the model here is also non-sparse. For very large p, the Dantzig selector method 
alone cannot work well. Thus, we use the sure independent screening (SIS, Fan 
and Lv 2008) to reduce the number of predictors to a moderate scale that is below 
the sample size, and then perform the variable selection and parameter estimation 
afterwards by the Gaussian Dantzig selector and our adjustment method. 

The experiment conditions are designed as: 

Pi = (1.0, -1.5, 2.0, 1.1, -3.0, 1.2, 1.8, -2.5, -2.0, 1.0) r , n = 100,p = 1000; 
p=0.1: 

Case 1. A p =4.50, i={l,2,3,4,5,6,7,8,9,10}, I = {1, 3, 5, 6, 7, 8, 9, 318, 514, 723, 760}; 

Case 2. A p =7.30, J={1,2,3,4,5,6,7,8,9,10}, / = {2,3,5,8,515,886}. 

p=0.5: 

Case 1. A p =3.56, J={1,2,3,4,5,6,7,8,9,10}, I = {1, 2, 5, 7, 8, 9, 846, 878, 976}; 
Case 2. A p =6.92, J={1,2,3,4,5,6,7,8,9,10}, I = {2,3,5,8,10,882,963}. 
p=0.9: 

Case 1. A p =1.80, J={1,2,3,4,5,6,7,8,9,10}, / = {3,5,8,10,415,432}; 

Case 2. A p =5.83, J={1,2,3,4,5,6,7,8,9,10}, / = {2, 3, 5, 114, 121, 839, 853, 882, 984}. 

With this design, the X p in case 1 results in that more significant predictors are 
selected into the sub-model than those in case 2 so that we can see the performance 
of the adjustment method. 
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Table 4. MSE, PE and their standard errors with n = 100 and p = 1000 



p Case 


MSE(stdMSE) 
(9 S 


PE(.sidPE) 
Y Y S Y S 


T 




0.7588(0.3497) 71.4031(7.5501) 
0.8523(0.5343) 122.8426(15.0952) 


6.8104(1.5485) 8.0107(1.6574) 94.7515(19.2968) 
13.1274(2.7772) 16.0812(3.4160) 189.7134(34.8081) 


200/200 
200/200 




3.6170(1.1823) 104.8420(13.5089) 
3.4771(1.2683) 92.3485(12.5122) 


9.9151(1.9902) 11.2352(2.2316) 133.4762(26.5058) 
11.6643(2.6704) 12.7811(2.8941) 134.3821(24.4896) 


200/200 
200/200 


0.9 1 


5.9027(2.7039) 107.6118(23.4383) 
3.8963(2.1760) 59.1525(11.3152) 


8.2842(1.6181) 11.3518(2.1745) 148.3143(27.4828) 
10.8033(2.1411) 12.9395(2.4835) 68.7272(13.4061) 


200/200 
200/200 



From Table 4, we have the conclusion that the SIS does work to reduce the 
dimension so that the Gaussian Dantzig selector and our method can be performed. 
Whether the correlation coefficient is small or large (the values of p change from 
0.1 to 0.9), the new method works better than the Gaussian Dantzig selector. The 
conclusions are almost identical to those when p is much smaller in Experiments 
1 and 2. Thus, we do not give more comments here. Further, by comparing the 
results of case 1 and case 2, we can see that the adjustment can work better when 
the sub-model is not well selected. 

In the following we further check the effect of model size when the dimension is 
larger. In doing so, we choose n = 150, p = 2000, p = 0.3; 
For (3j = (4.0, —1.5, 6.0, —2.1, — 3.0) r , consider two cases: 
Case 1. A p =3.45, J={1,2,3,4,5}, I = {1, 2, 3, 4, 5, 15, 1099, 1733}; 
Case 2. A p =8.36, J={1,2,3,4,5}, I = {1,3,554,908}. 

For /3j = (4.0, -1.5, 6.0, -2.1, -3.0, 1.2, 3.8, -2.5, -2.0, 7.0) T , consider two cases: 
Case 1. A p =3.02, J={1,2,3,4,5,6,7,8,9,10}, / = {1, 2, 3, 5, 7, 8, 9, 10, 1701}; 
Case 2. A p =9.08, J={1,2,3,4,5,6,7,8,9,10}, I = {1, 3, 5, 7, 8}. 
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Table 5. MSE, PE and their standard errors with n = 150, p = 2000, p = 0.3 



S Case 


MSE(stdMSE) 
6 6 S 


PE(sfcZPE) 


T 


' \ 


0.4245(0.2102) 262.6392(21.2109) 
1.9510(1.0923) 359.5838(32.4150) 


6.4015(1.3038) 6.3439(1.2879) 322.9945(62.6228) 
24.1959(4.8932) 24.8013(5.1629) 559.3584(98.1216) 


200/200 
200/200 


«> 2 


0.8799(0.5108) 498.7862(59.0383) 
1.8524(0.7599) 68.1862(43.3612) 


10.6009(2.3903) 12.3505(2.6381) 946.3400(175.1009) 
15.0471(2.8069) 16.9161 (3.1755) 1623.4936(111.5972) 


200/200 
200/200 



The results in Table 5 show that the SIS is again useful for reducing the dimension 
for the use of the Gaussian Dantzig selector and our method, and furthermore the 
new method works better than the Gaussian Dantzig selector. On the other hand, 
when the number of significant predictors is smaller, estimation accuracy can be 
better with smaller MSE and PE. In other words, when the number of significant 
predictors is smaller, variable selection can perform better and sub-model can be 
more accurate (case 1 with 5 significant predictors). 

Experiment 4. This experiment is designed for checking that although our 
method is designed for the non-sparse model, it is also comparable to the method 
designed for sparse model when the true model is sparse indeed. We also consider 
three type of /3 which is the same as those in Experiment 1 except that all compo- 
nents of j3_j are zero. The simulation result is reported in Table 6 below. 
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Table 6. MSE, PE and their standard errors with n = 50 and p = 100 for the sparse case 



type p 


MSE(sidMSE) 
6 0s 


PE(sidPE) 
Y Y S Y S 


T 


0.1 
0.3 
(I) 0.5 
0.7 
0.9 


0.9938x 10- a (0.0040) 0.9324x KT a (0.0037) 
0.0013(0.0051) 0.0033(0.0118) 
0.0036 (0.0128) 0.0068(0.0239) 
0.0066(0.0187) 0.0100(0.0278) 
0.1198(0.6479) 0.1292(0.6619) 


0.0485(0.0114) 0.0481(0.0113) 0.0469(0.0109) 
0.0668 (0.0152) 0.1373(0.0262 ) 0.1440(0.0290) 
0.1856(0.0429) 0.2905(0.0603) 0.2999(0.0640) 
0.2485(0.0578) 0.3288(0.0713) 0.3311(0.0708) 
0.3506(0.0758) 0.4624(0.0881) 0.4630(0.0867) 


71/200 
134/200 
138/200 
115/200 
99/200 


0.1 
0.3 
(II) 0.5 
0.7 
0.9 


0.0010(0.0039) 0.0010(0.0039) 
0.0028(0.0105) 0.0029(0.0110) 
0.0029(0.0104) 0.0030(0.0113) 
0.0052(0.0160) 0.0072(0.0209) 
0.0059(0.0169) 0.0220(0.0460) 


0.0482(0.0112) 0.0479(0.0109) 0.0468(0.0102) 
0.1473(0.0315) 0.1529 (0.0324) 0.1485(0.0330) 
0.1462(0.0315) 0.1526(0.0328) 0.1496(0.0329) 
0.2832(0.0626) 0.3460(0.0736) 0.3477(0.0743) 
0.3360(0.0921) 0.5392(0.1333) 0.5250(0.1202) 


73/200 
80/200 
85/200 
114/200 
83/200 


0.1 

0.3 
(III) 0.5 
0.7 
0.9 


0.9773xl0- 3 (0.0040) 0.9425 x 10~ 3 (0.0039) 
0.0034(0.0120) 0.0060(0.0218) 
0.0046(0.0142) 0.0073(0.0215) 
0.0076(0.0200) 0.0114(0.0277) 
0.0100(0.0229) 0.0148(0.0321) 


0.0483(0.0108) 0.0479(0.0108) 0.0468(0.0105) 
0.1697(0.0383) 0.2547(0.0516) 0.2629(0.0547) 
0.2386(0.0573) 0.3260 (0.0700) 0.3269(0.0679) 
0.3633(0.0775) 0.4997 (0.1016) 0.5030(0.1050) 
0.4641(0.1067) 0.6082 (0.1264) 0.6012(0.1224) 


71/200 
132/200 
122/200 
112/200 
83/200 



From this table, we can see that even in sparse cases, for every type of /3, the 
new estimator 9 is in almost all cases better than 8 s is in the sense of smaller MSE. 
This is also the case for prediction: Y has smaller prediction error than Yg does 
when p > 0.1. It is not surprise that Ys cannot be as good as its performance in 
non-sparse cases, but still comparable to Ys- From the table, we can see that Ys 
is usually better than Ys when p is either 0.1 or 0.9 and r < 0.5 whereas when 
0.3 < p < 0.7, the prediction error of Ys is larger and r > 0.5 except for the cases 
with p = 0.3, 0.7 in type II of 0. Overall, the new method is still comparable to the 
classical method in the sparse models under study. 

In summary, the results in the six tables above obviously show the superiority of 
the new estimator 9 and the new sub-model (3.3)/the sub-model (2.4) over the others 
in the sense with smaller MSEs, PEs and standard errors, and large proportion r 
in non-sparse models. The good performance holds for different combinations of 
the sizes of selected sub- models (values of X p ), n,p,S,I, R 2 and the correlation 
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between the components of X. The new method is particularly useful when a 
submodel, as a working model, is very different from underlying true model. Thus, 
the adjustment method is worth of recommendation. Also it is comparable to the 
classical method in sparse case, suggesting its robustness against model structure. 
However, as a trade-off, the adjustment method involves nonparametric estimation, 
although low- dimensional ones. It makes estimation not as simple as that obtained 
by the existing ones. Thus, we may consider using it after a check whether the 
submodel is significantly biased. The relevant research is ongoing. 

Supplementary Materials. 

Proofs of the theorems: The pdf file "supplement-l.pdf containing detailed proofs 
of the lemmas and theorems. 

Matlab package for DANTZIG CODE routine: Matlab package "DANTZIG 
CODE" containing the codes. (WinRAR file) 
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